======================================================== #Intro This project is an analysis of a red wine’s sample which includes 1599 kinds of red wine with 11 variables about the chemical ingredients of wine. After statistics and investigation, almost 3 experts of red wine grade each kind of wine’s quality and then provide a fraction between 0(worst) and 10(perfect). The leading question is which chemical ingredients will affect the quality of red wine.
Based on the data frame, I will process a overview of all variables at first, and then I will explore the relationships or correlation between variables. Through this exploration, I will think about and deal with some problems based on the discovery.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
It looks like the first column’s name is ‘X’, but I think it is better to drop this variable which has no impact in this analysis.
# Remove the 'X' column using logical statement
myvar <- names(redwine) %in% c('X')
redwine <- redwine[!myvar]
colnames(redwine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From the graph shows, it is clear that the range of quality is from 3 to 8, and 5 is the most score in whole. Furthermore, 5-6 covers the vast majority of whole observation, which is over 1200 in statistic; the number of high quality from 7 to 8 just over 200 a little; low quality which are 3 or 4 in score even less than 100 in total observation.
From the description of attributes, I noticed that all features can be divided into 4 groups, which are acid group, substance group, chemical group and measure group. And then, I will create related plots to display the charactors of each group.
# Table the variable to see the number of observation whose value equals to 0 in
# fixed.acidity, volatile.acidity, and citric.acid
table(redwine$fixed.acidity == 0)
##
## FALSE
## 1599
table(redwine$volatile.acidity == 0)
##
## FALSE
## 1599
table(redwine$citric.acid == 0)
##
## FALSE TRUE
## 1467 132
As the figure exhibits, the distributions of both fixed acidity and volatile acidity are right skewed, and both of them tend to normal distribution. And it is clear that 132 samples do not include values of citric acid.
## Warning: Removed 91 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 48 rows containing non-finite values (stat_bin).
It is obvious that the distributions of residual sugar and chlorides are right skewed, so I changed the limitation of x axis depending on the mean value of summary in the last section, which can deeply observe the trend of these two variables histogram.
After doing modification about the x axis, three graph tells me that changed figures of residual sugar and chlorides both show a normal distribution, and the observation of alcohol displays an analogous normal distribution which are right skewed.
These three plots all shows that they involve some outliers, and after avioding there outliers, the third figure looks like a normal distribution in particular; however, the original observation of three plots are all skewedto right side.
## $x
## [1] "Density"
##
## attr(,"class")
## [1] "labels"
## $x
## [1] "pH"
##
## attr(,"class")
## [1] "labels"
From these two figures, it is obvious that the histograms of variable density and variable pH are normal distribution. And it looks like there is no apparant outlier.
From the summary of the datafram ‘redwine’, it shows that the dataframe contains 1599 observations, and each observation has 13 unique attributions which are variables of redwine. They are in the following:
The main feature of this dataset is quality, and I am interested with which variables may impact the quality, and how it affects. According to the classification, I have categorized all variables into 3 groups, and I guess there maight be some correlationship between each group, so I will explore the relationship of two variables at first. After that, I will process further investigation with multiple variables.
In my opinion, I think density and pH might be the impacted factors of quality, and the variables in acid group will affect the value of pH which is similar as the relationship between density and substance group. So, variables in these two groups will assist my analysis of the interested feature.
Well, I think it is better to change the variable ‘X’ into ‘id’ so that it looks better. And then I want to create two variables according to the description of each variable, and they are: 1. total.acidity, which is the total quantity of fixed acidity and volatile acidity; 2. bound.sulfur.dioxide, which is the difference between total sulfur dioxide and free sulfur dioxide.
# Create two columns with new variables which are total.acidity and
# bound.sulfur.dioxide
redwine$total.acidity <- redwine$fixed.acidity + redwine$volatile.acidity
redwine$bound.sulfur.dioxide <-
redwine$total.sulfur.dioxide - redwine$free.sulfur.dioxide
# Create a new dataset with standby application
redwine2 <- redwine
After creating each variable’s histogram, I notice that the distribution of citric acid is abnormal comparing others, which has a lot of 0 values and that means many red wine samples don’t contain this element. According to other variables’ plots, most of them are right skewed distribution, and some are normal distribution. I just changed the title of first column because this dataframe is tidy and clean.
First of all, I must create the visualization of all variablesc(except id) with quality to search for the strongest correlation both in positive and negative.
From these three plots, we can see the fixed.acidity and total.acidity have no obvious changing trend with the quality’s variation, but the volatile.acidity decreases when the level of quality raises, and the citric.acid has a positive correlation with the quality.
As plots show, residual sugar and chlorides always remains in a low level of quantity, and both of them don’t shows any apperant correlationship with quality, but the alcohol seems to keep increasing from low level of quality to high level. By the way, it is likely that the chlorides have a hazy negative correlationship with quality.
It is clear that sulphates has a legibel positive correlation with the quality, and in a specific range, both bound sulfur dioxide and total sulfur dioxide decrease when the quality increase, which means they have negative correlationship.
The first plot shows the negative correlationship between density and quality of red wine, and the second one display a decreasing trend with the rise of quality’s level.
These plots show that the variables which have a great positive correlation with the quality are: - citric.acid
- sulphates
- alcohol
The powerful negative correlation variables with the quality: - volatile.acidity - chlorides - bound.sulfur.dioxide
- total.sulfur.dioxide
- density
Then, I need to gather the specific correlation coefficient of the variable listed above:
# First of all, apply the standby application because the quality variable has
# bee changed
redwine <- redwine2
# Create correlation coefficient for the variables which have a great positive
# correlation with the quality
cor.test(redwine$citric.acid, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$citric.acid and redwine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
cor.test(redwine$sulphates, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$sulphates and redwine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
cor.test(redwine$alcohol, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
# Create correlation coefficient for the variables which have a great negative
# correlation with the quality
cor.test(redwine$volatile.acidity, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$volatile.acidity and redwine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
cor.test(redwine$chlorides, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$chlorides and redwine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
cor.test(redwine$bound.sulfur.dioxide, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$bound.sulfur.dioxide and redwine$quality
## t = -8.3898, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2519465 -0.1580336
## sample estimates:
## cor
## -0.205463
cor.test(redwine$total.sulfur.dioxide, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$total.sulfur.dioxide and redwine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
cor.test(redwine$density, redwine$quality, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: redwine$density and redwine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
From the information provided above, it is better to extract the most positive and negative correlation. The largest correlation coefficient is 0.4761663 which is from alcohol, and the smallest correlation coefficient is is -0.3905578 which is from volatile.acidity.
By the way, sometimes there might be interesting relationship between two variables with no logic in mind, but it is a good chance to explore more about these variables. So, I tend to use ‘ggpairs’ to look for surprised correlationship.
ggpairs(redwine)
From the plot shows, most of conditions with a big correlation coefficient are between two relative variables, like citric.acid and total.acid; however, I still find some strange relationships, which are in the following: fixed.acidity - density alcohol - density
Compared to the relationship between alcohol and density, the correlation coefficient of residual sugar and density is less than 0.5, but it should be larger than what it is. Also, the effect of the element related to sulfur dioxide towards to pH is less than what I expect because I think this kind of things will decrease the pH of red wine.
That is the relationship between fixed.acidity and total.acidity, whose correlation coeffecient is 0.995. And it means that the rate of the volatile acidity in the total acidity is quite small.
As the four groups I have divided previously, some variables affect the quality of red wine concurrently in the same group, which are citric acid with volatile acidity, total.sulfur.dioxide with sulphates, and chlorides with alcohol.
This figure shows that more citric acid quantity and less valatile acidity quantity might match red wine with higher quality. Although the smooth lines give a clear trend of points distributed in different colors, it still a simulator with error.
It is obvious that except little outliers, almost all high quality of red wine are in the condition with more sulphates value and less total sulfur dioxide in relatively, which means when sulfur dioxide is in a small quantity, the more sulphates the red wine has, the higher quality of this wine is in.
This plot is similar with the last plot, high alcohol and low chlorides are related to high quality red wines. It makes sense that none would like to taste salty red wine, and the rate of alcohol in red wine is larger in high quality than in low quality in relatively.
There are still some relationship between two variables which are not in a same group, like chlorides with sulphates, and fixed.acidity with residual.sugar.
The figure has a total diference than others, which is the condition with both high quantity of two variables. And it shows that when chlorides and sulphates are both high in value, the quality of red wine would be quite low. By the way, the high quality wine only exsit on high sulphates and low chlorides.
From the plot, it is hard to say the pattern of these two variables with quality, but it looks like if residual sugar is constant, the quality of red wine will increase with the rise of fixed acidity quantity until it equals to about 9, and then it reverse to the lowest quality at once. After that, it will increase a little level of quality for a short while. All in all, residual sugar just maintains in a very low quantity in red wine.
I am surpried that fixed.acidity has the most positive correlation with density rather than others, and the correlation coefficients between chlorides and other variables are averagely small, but chlorides has a relatively tight relationship with sulphates.
# Build a linear model between alcohol and quality
m1 <- lm(I(alcohol) ~ I(quality), data = redwine)
# Add volatile.acidity to the model
m2 <- update(m1, ~.+volatile.acidity)
mtable(m1, m2)
##
## Calls:
## m1: lm(formula = I(alcohol) ~ I(quality), data = redwine)
## m2: lm(formula = I(alcohol) ~ I(quality) + volatile.acidity, data = redwine)
##
## ================================================
## m1 m2
## ------------------------------------------------
## (Intercept) 6.882*** 6.998***
## (0.165) (0.220)
## I(quality) 0.628*** 0.618***
## (0.029) (0.032)
## volatile.acidity -0.115
## (0.142)
## ------------------------------------------------
## R-squared 0.227 0.227
## adj. R-squared 0.226 0.226
## sigma 0.937 0.937
## F 468.267 234.406
## p 0.000 0.000
## Log-likelihood -2164.504 -2164.179
## Deviance 1403.295 1402.725
## AIC 4335.007 4336.358
## BIC 4351.139 4357.866
## N 1599 1599
## ================================================
This is the linear model, and it is clear that the R-squared is 0.227, which means the fitting degree is not good enough. So, this model is unuseful for my analysis, and it cannot show the correct correlationship between most correlated variables and quality.
This is the modified version of the first plot in this EDA, and the reason I choose this one is that quality is the main feature I have to explore in this process, so I must realize all information of this feature, and after that I can enter the next step. From this figure, it is clear that the integral part is in the fraction 5 and 6, which means this dataset is valide enough to explore the relationship between quality and other variables because the distribution of quality is normal.
This figure is the most successful one in all plots, and I change the x and y comparing the original one because I think after changing, it is more clear about the linear relationship between these two variables.It shows that each quality has a unique regression in negative correlation with different intercepts, and the quality in 8 has the largest numeric in intercept, which means high quality red wine has high value of citric acid and low rate of volatile acidity.
Alcohol is the most positive correlated variable with the quality of red wine in this dataset, from this modified figure, it is clear that the red wine with highest quality is in the range from 11% to 13% alcohol, and it also shows this highest quality red wine just involves a quite low level of chlorides. And with the increase of the amount of chlorides, the quality of red wine drops gradually.
This is a very long and complicated project for myself because this is the first time I must provide my own idea for the direction of exploration. Although the dataset is not a huge one, which just includes 1599 observations and 13 variables, I still feel that when I create plot and build model, there are some outliers which will bother my analysis even if they are just a little.
After exploring in my own, there are some insight about the strongest correlationship between quality and other variables. The relatively positive correlation coefficient are 0.476 from alcohol and 0.251 from sulphates; the relatively negative correlation coefficient is -0.391 from volatile.acidity. Even through these variables have tight relationship with quality, many other variables still affect the level of quality in red wine which should be proved by more complex model building or analysis.
When I was exploring, I noticed that some variables belong to the same category, so I firstly divided them into 4 different groups. Then I found that analyzing two independed variables in the same group as first step, and secondly process in different groups made the whole process more reliable. All information for two variables’ exploration is specific and easy to understand, like the trend of each plot is obvious, correlation coefficient of each pair is more identified.
There are still some phenomenons that I didn’t expect before, like the relationship between volatile acidity and pH is negative correlation, the correlationship between residual sugar and density is quite weak, and so on.
In the future, I think it is better to add more variables like the ingredient of red wine, the environmental temperature when red wine is produced, and so on. Then, it is better to append some text description of some variables, like adding unit of variable, providing some tips about the effect of low or high quantity of the variable.